[superseded] Different Hashing to avoid Collisions. #11767

timotheecour · 2019-07-17T20:49:14Z

fixes HashSet[uint64] slow insertion depending on values #11764
it turns out all primitive types of size >= 4 bytes (eg int32, pointer, uint, float etc) are affected by the bug, resulting in 100 to 1000 slowdown depending on conditions
for eg for pointer, the 3rd case is 1K slower; float also is similar; this PR fixes that
also fixes another bug that would cause hash collisions for small floats (previous code was using x+1.0 which leads to 0 for small floats)

see test cases here to reproduce:
results: see https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129b.nim

note

as I also observed (see test case), the faster hashing PR #11203 introduced a 5X slowdown 2.229572/0.424859 in some cases, eg using uint64 or uint32 and more (for 3rd case, with let n = 100_000 * 10), even after my PR #11767 is applied (ie after hash input as as a string): in other words, bytewiseHashing and murmur3 are 5X faster compared to the multibyte hash introduced in #11203

this is related to what I had observed in #11581 but introduces the new observation that the multibyte extension to the jenkins hash is also affected for smaller inputs than oids, such as uint64 or even uint32. The same conclusion as #11581 follow: we should adopt murmur3 (or at least reconsider implementation of #11203) which always comes out the fastest ; I have provided a pure nim implementation and suggested how to make it work at CT (via vm register callback)

[EDIT] that 2nd point won't be observable after latest commit since code now uses hashData(cast[pointer](unsafeAddr x), T.sizeof) which for some reason is implemented differently than hash*[A](aBuf: openArray[A], sPos, ePos: int), ie doesn't use multibyte jenkins anymore, but the point remains that multibyte jenkins can still result in 5x slowdown even for small (4B) inputs

…nt etc

lib/pure/hashes.nim

krux02 · 2019-07-17T23:21:52Z

can you please explain the problem and what you did differently in order to fix it, before you claim that you made it 1000 times faster.

timotheecour · 2019-07-17T23:56:02Z

before PR, the hash was hash(x)=x bitand 2^n-1 which is a terrible hash resulting in lots of (trivial) collisions, ignoring all high order bits
after PR, string hash (based on a multibyte modification of jenkins hash) is applied for all x: sizeof(x)>=4
sizeof(x)<4 was faster using the preexisting identity hash so the fix checks for that sizeof(x)>=4 criterion

…ions for small flaots

Araq · 2019-07-18T06:49:33Z

@narimiran is working on a Murmur3 implementation.

mratsim · 2019-07-18T06:58:31Z

Obviously we can't have fancy intrinsics in the VM but if it's not too complex CityHash or Daniel Lemire's CLHash can be considered:

narimiran · 2019-07-18T10:18:04Z

Obviously we can't have fancy intrinsics in the VM

I've already made Murmur3 work in the VM. The only remaining problem is JS backend.

Varriount · 2019-07-18T13:19:29Z

@timotheecour Wow! Nice catch!

Araq · 2019-10-02T18:40:55Z

New hash algorithm is shipping with v1, closing.

timotheecour · 2020-02-19T21:08:47Z

superseded by #13418

fix nim-lang#11764: make sets 1000 times faster for pointer, int64, i…

71c17f0

…nt etc

timotheecour mentioned this pull request Jul 17, 2019

HashSet[uint64] slow insertion depending on values #11764

Closed

timotheecour changed the title ~~fix #11764: make sets 1000 times faster for pointer, int64, int etc~~ fix #11764: make sets (tables etc) 1000 times faster for pointer, int64, int etc Jul 17, 2019

timotheecour mentioned this pull request Jul 17, 2019

regression: hashes makes tables 100x slower on some inputs, eg oids #11581

Closed

timotheecour marked this pull request as ready for review July 17, 2019 21:11

krux02 reviewed Jul 17, 2019

View reviewed changes

lib/pure/hashes.nim Outdated Show resolved Hide resolved

fix nim script test

0302ad4

timotheecour force-pushed the pr_fix_11764 branch from 814a02a to 0302ad4 Compare July 17, 2019 23:40

avoid $x string allocation; fix another bug that could lead to collis…

93b46c4

…ions for small flaots

fix tests using hashBiggestIntVM vm callback

a4eb96d

timotheecour force-pushed the pr_fix_11764 branch from 7847592 to a4eb96d Compare July 18, 2019 09:10

krux02 changed the title ~~fix #11764: make sets (tables etc) 1000 times faster for pointer, int64, int etc~~ Different Hashing to avoid Collisions. Jul 18, 2019

krux02 mentioned this pull request Jul 24, 2019

[superseded] new macros.genAst: sidesteps issues with quote do #11722

Closed

timotheecour mentioned this pull request Aug 26, 2019

hashes: implement murmur3 #12022

Merged

Araq closed this Oct 2, 2019

timotheecour mentioned this pull request Feb 13, 2020

fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410

Closed

timotheecour changed the title ~~Different Hashing to avoid Collisions.~~ [superseded] Different Hashing to avoid Collisions. Feb 19, 2020

timotheecour deleted the pr_fix_11764 branch February 19, 2020 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[superseded] Different Hashing to avoid Collisions. #11767

[superseded] Different Hashing to avoid Collisions. #11767

timotheecour commented Jul 17, 2019 •

edited

Loading

krux02 commented Jul 17, 2019

timotheecour commented Jul 17, 2019

Araq commented Jul 18, 2019

mratsim commented Jul 18, 2019

narimiran commented Jul 18, 2019

Varriount commented Jul 18, 2019

Araq commented Oct 2, 2019

timotheecour commented Feb 19, 2020

[superseded] Different Hashing to avoid Collisions. #11767

[superseded] Different Hashing to avoid Collisions. #11767

Conversation

timotheecour commented Jul 17, 2019 • edited Loading

note

krux02 commented Jul 17, 2019

timotheecour commented Jul 17, 2019

Araq commented Jul 18, 2019

mratsim commented Jul 18, 2019

narimiran commented Jul 18, 2019

Varriount commented Jul 18, 2019

Araq commented Oct 2, 2019

timotheecour commented Feb 19, 2020

timotheecour commented Jul 17, 2019 •

edited

Loading